An Introduction to Data Privacy in Practice

Torus Talk - MacEwan University, March 2026

Katie Burak

About Me

Katie Burak

Katie Burak
Assistant Professor of Teaching, Department of Statistics, UBC https://katieburak.github.io/

https://katieburak.github.io/torus-talk-2026/

Attribution





What Are You Comfortable Sharing?

  • Your favorite type of music
  • Your Instagram likes and follows
  • Your e-mail
  • Your name and DOB
  • Your GPS location throughout the day
  • Your browsing history
  • Your private messages/DMs
  • Discussion:
  1. Which of these data would you feel comfortable sharing with an app?
  2. What questions would you want to ask before sharing this data?
  3. What if it combined two or three pieces of information?

Source: CBC News

  • Even something as simple as your Facebook “likes” can reveal a lot more than you think…
  • Researchers at Cambridge showed that algorithms could predict:
    • Sexual orientation with up to 88% accuracy
    • Race with 95% accuracy
    • Political affiliation with 85% accuracy
  • All from analyzing the pages and posts you “liked” (no profile bio or messages needed)!

https://www.cam.ac.uk/research/news/digital-records-could-expose-intimate-details-and-personality-traits-of-millions

What Happens to Your Data?

Every time you use an app, visit a website, click on a link, fill out a survey or even just scroll on your device, your data is being:

  • Collected - What you click, search, watch, like or buy
  • Analyzed - Used to predict your behaviour, interests or identity
  • Shared or Sold - Passed to advertisers, data brokers or other companies

Why Does This Matter?

  • You may be targeted with ads, content and potentially misinformation
  • You could be judged or profiled based on your data (even if it’s not accurate)
  • You rarely know who has your data (or what they’re doing with it)



  • So what does this mean for us? Let’s explore how data can be used, what makes certain information sensitive and why it matters.

Personally Identifiable Information (PII)

  • PII refers to any data that can be used to identify a specific individual.
  • Direct identifiers: These clearly and uniquely point to a person.
    • Examples: name, social security number, patient ID
  • Indirect identifiers: These don’t identify someone on their own, but could when combined.
    • Examples: age, DOB, postal code, race, sex

Personal Data

Data can be identifiable when:

  • They contain directly identifying information.
  • It’s possible to single out an individual
  • It’s possible to infer information about an individual based on information in your dataset
  • It’s possible to link records relating to an individual.
  • De-identification is still reversible.

Scenario: Can This Data Identify You?

A fitness app shares anonymized data with researchers. The dataset includes:

  • Step count per day
  • General location (postal code)
  • Age
  • Time of day the user exercises
  • Health conditions

Separately, a publicly available dataset includes information from a local running club: names, age groups and 5K race times.

The Mosaic Effect

  • The “Mosaic Effect” can happens when separate pieces of data, which alone don’t identify anyone, are combined from different sources to reveal personal information or identify an individual.

  • In 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth.

https://dataprivacylab.org/projects/identifiability/paper1.pdf

Pseudonymization and Anonymization

  • Pseudonymization and anonymization are techniques to de-identify personal data
  • Goal: reduce linkability of data to individuals
  • We will now define each of these terms

Pseudonymization

  • Reduces linkability of data to individuals
  • Data cannot identify individuals without additional information
  • Often done by replacing direct identifiers with pseudonyms
  • Link between real identifiers and pseudonyms is stored separately
  • Re-identification remains possible!

Anonymization

  • Data are anonymized when no individual is identifiable (directly or indirectly)
  • This applies even to the data controller
  • Fully anonymized data are no longer personal data
  • Anonymization is difficult to achieve in practice

Identifiability Spectrum

  • Identifiability is a spectrum
  • More de-identified data = closer to anonymized
  • Lower identifiability = lower re-identification risk

https://www.kdnuggets.com/2020/08/anonymous-anonymized-data.html

When Are Data Truly anonymous?

  • Only if re-identification would require unreasonable effort (factors include cost, time and available technology)
  • Data are not anonymous if:
  • Direct identifiers are present
  • Individuals can be singled out from a group
  • Re-identification possible via linking datasets (mosaic effect)
  • Inference about identity is possible (e.g., through different variables)
  • De-identification can be reversed

De-identification Techniques

Techniques to deidentify your data include:

  • Suppression
  • Generalization
  • Replacement
  • Top- and bottom coding
  • Adding noise
  • Permutation

First, let’s generate some data we can use to help illustrate these concepts.

library(tidyverse)

df <- tibble(
  name = c("Joel Miller", "Ellie Williams", "Tommy Miller", "Abby Anderson"),
  age = c(52, 19, 48, 28),
  height_cm = c(182, 160, 185,173) 
)

df
# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Joel Miller       52       182
2 Ellie Williams    19       160
3 Tommy Miller      48       185
4 Abby Anderson     28       173

Suppression

  • Remove entire variables, values or records
  • Used to eliminate highly identifying or unnecessary data
  • Examples:
    • Names, contact details, social security numbers
    • GPS metadata, IP addresses, neuroimaging facial features
    • Outliers or unique participants

Suppression Example

df_suppressed <- df |>
  select(-name)

df_suppressed
# A tibble: 4 × 2
    age height_cm
  <dbl>     <dbl>
1    52       182
2    19       160
3    48       185
4    28       173

Generalization

  • Reduces detail or granularity in the data
  • Makes individuals harder to single out
  • Examples:
    • Convert date of birth to age, or group into ranges
    • Replace address with town or region
    • Recategorize rare labels into “other” or “missing”
    • Abstract people or places in qualitative data (e.g., “Bob” to “[colleague]”)

Here we will show an example of generalization on the age column:

df_generalized <- df |>
  mutate(age_group = case_when(
    age < 30 ~ "under 30",
    TRUE     ~ "30+"
  ))|>
  select(-age)

df_generalized
# A tibble: 4 × 3
  name           height_cm age_group
  <chr>              <dbl> <chr>    
1 Joel Miller          182 30+      
2 Ellie Williams       160 under 30 
3 Tommy Miller         185 30+      
4 Abby Anderson        173 under 30 

Replacement

  • Swap identifying info with less informative alternatives
  • Examples:
    • Use pseudonyms for names (with securely stored keyfile)
    • Replace with placeholders (e.g., “[redacted]”)
    • Rounding numeric values

Creating Pseudonyms

  • Pseudonyms should reveal nothing about the subject
  • Good pseudonyms:
    • Are random or meaningless strings/numbers
    • Are securely managed (e.g., encrypted keyfile)
  • Can be generated using tools in Excel, R, Python, SPSS

Replacement with Pseudonyms

df_pseudonymized <- df |>
  mutate(pseudonym = paste0("ID", row_number())) |>
  select(pseudonym, everything(), -name)

df_pseudonymized
# A tibble: 4 × 3
  pseudonym   age height_cm
  <chr>     <dbl>     <dbl>
1 ID1          52       182
2 ID2          19       160
3 ID3          48       185
4 ID4          28       173

Hashing

  • Hashing converts names into fixed-length, irreversible strings.
  • Unlike pseudonyms, hashed values cannot be easily reversed.
  • In R, we can use the digest package (and function) to hash.

library(digest) 

df_hashed <- df |>
  rowwise() |>
  mutate(name_hash = digest(name)) |>
  select(name_hash, everything(), -name)

df_hashed
# A tibble: 4 × 3
# Rowwise: 
  name_hash                          age height_cm
  <chr>                            <dbl>     <dbl>
1 4a3e0ee26ab3fb1338e893f4d4e7244b    52       182
2 201943dd66d423ed3cce2242a75736d4    19       160
3 81699ec9483bad176eed57ee43ffa010    48       185
4 046dff9ba9cf33573396f4de8c0c0e0b    28       173

Top- and Bottom-Coding

  • Limits extreme values in quantitative data
  • Recode all values above or below a threshold
  • Example: all incomes above $150,000 become $150,000
  • Preserves much of the dataset, but distorts distribution tails

Top-coding example

  • Consider 6ft (182.88cm) is considered our maximum height threshold.
df_top_coded <- df |>
  mutate(height_cm = if_else(height_cm > 182.88, 182.88, height_cm))

df_top_coded
# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Joel Miller       52      182 
2 Ellie Williams    19      160 
3 Tommy Miller      48      183.
4 Abby Anderson     28      173 

Adding Noise

  • Introduces randomness to protect sensitive info
  • Examples:
    • Add a small random amount to numeric values
    • Blur images or alter voices

Adding Noise to Height

set.seed(200) 

df_noisy <- df |>
  mutate(height_cm_noisy = height_cm + rnorm(n(), mean = 0, sd = 2)) |>
    select(-height_cm)

df_noisy
# A tibble: 4 × 3
  name             age height_cm_noisy
  <chr>          <dbl>           <dbl>
1 Joel Miller       52            182.
2 Ellie Williams    19            160.
3 Tommy Miller      48            186.
4 Abby Anderson     28            174.

Permutation

  • Swap values between individuals
  • Makes linking variables across a record more difficult
  • Maintains distributions, but breaks correlations
  • Can limit the types of analyses possible

Permutation of Height Values

set.seed(200)

df_permuted <- df |>
  mutate(height_cm_permuted = sample(height_cm)) |>
    select(-height_cm)

df_permuted
# A tibble: 4 × 3
  name             age height_cm_permuted
  <chr>          <dbl>              <dbl>
1 Joel Miller       52                160
2 Ellie Williams    19                173
3 Tommy Miller      48                182
4 Abby Anderson     28                185

Privacy vs. Utility Tradeoff

https://www.researchgate.net/figure/Trade-off-between-privacy-level-and-utility-level-of-data_fig1_357987903

Case Study: Brogan Inc. and NIHB Data

  • The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
  • In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
  • Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
  • Brogan sold the data to pharmaceutical companies for commercial research and marketing
  • Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.

Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.

Discussion

  • Was the data truly de-identified?
  • What are the limits of simply removing names and IDs from a dataset?
  • How can we measure whether a dataset is truly “safe” to release?
  • Should de-identified data still require community consent before being shared or sold?

Why basic deidentification isn’t always enough

  • Individuals can often be re-identified using other information.

  • As datasets become more detailed and linkable, privacy risks increase.

  • More advanced statistical methods are often needed to ensure meaningful deidentification while preserving data utility.

Statistical approaches to deidentification



  • \(k\)-anonymity
  • \(l\)-diversity
  • Differential privacy (advanced)

Overview of privacy models

  • \(k\)-anonymity and \(l\)-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
  • They focus on how variables combined can lead to identification.
  • These approaches are complementary: a dataset can be simultaneously \(k\)-anonymous and \(l\)-diverse, where \(k\) and \(l\) represent numeric thresholds.
  • \(k\)-anonymity and \(l\)-diversityare typically used to de-identify tabular datasets before sharing.
  • They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Privacy models distinguish between three types of variables:

  • Identifiers: Direct identifiers such as names, student numbers, email addresses.

  • Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.

    • Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
  • Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.

    • Examples: Medical condition, Income, etc.

Importance of Correct Variable Categorization

  • Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
  • This categorization determines how to de-identify your dataset effectively using \(k\)-anonymity, \(l\)-diversity, and \(t\)-closeness.
  • Now, let’s discuss each of these techniques in detail…

\(k\)-anonymity

  • A data set is \(k\)-anonymous if each observation cannot be distinguished from at least \(k-1\) other observations based on the quasi-identifiers.
  • This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
  • Applying \(k\)-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
  • It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.

Making a data set \(k\)-anonymous

  1. Identify variables as identifiers, quasi-identifiers and sensitive attributes.
  2. Choose a value for \(k\).
  3. Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.

Choosing \(k\)

  • There is no single correct value for \(k\)!
  • Higher \(k\) increases privacy, but reduces data detail and utility.
  • The choice depends on promises made to data subjects and acceptable risk levels.

Source: k2view.com

Example data

  • Age and city are quasi-identifiers, and salary is considered a sensitive attribute.
Age City Salary
38 Calgary 91,000
37 Toronto 92,000
31 Vancouver 82,000
48 Calgary 115,000
39 Vancouver 118,000
37 Calgary 97,000
34 Toronto 98,000
33 Vancouver 89,000
32 Toronto 108,000
45 Calgary 95,000

\(k=2\)

Age Range City Salary Range
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 Vancouver 110,000–119,999
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
30–39 Toronto 100,000–109,999
40–49 Calgary 90,000–99,999

Given the data, which field(s) could you generalize to help achieve k = 3 anonymity?

Age ZIP Code Disease
29 13053 Flu
27 13068 Flu
28 13068 Cold
45 14853 Diabetes
46 14853 Diabetes
47 14853 Cancer
  • A. Generalize Age into age ranges (e.g., 20–29, 40–49)
  • B. Suppress Disease entirely
  • C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
  • D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
  • E. It’s already \(k=3\) anonymous

Which of the following datasets violates \(k = 2\) anonymity?

Option A

Age Sex ZIP
34 M 02138
34 M 02138
34 F 02139

Option B

Age Sex ZIP
22 F 10011
22 F 10011
22 F 10011

Option C

Age Range Sex ZIP Prefix
30–39 * 021**
30–39 * 021**
30–39 * 021**
  • A. Only A
  • B. Only B
  • C. Only C
  • D. A and B

\(l\)-diversity

  • \(l\)-diversity is an extension of \(k\)-anonymity that ensures sufficient variation in a sensitive attribute.
  • This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.

  • Although these data are \(2\)-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.
Age Range City Salary Range
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 Vancouver 110,000–119,999
30–39 Calgary 90,000–99,999
30–39 Toronto 90,000–99,999
30–39 Vancouver 80,000–89,999
30–39 Toronto 100,000–109,999
40–49 Calgary 90,000–99,999

\(l\)-diversity

  • The approach requires at least \(l\) different values for the sensitive attribute within each combination of quasi-identifiers.
  • Again, there is no perfect value for \(l\) (typically \(1< l \leq k\)).

  • With \(l=2\), that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.
Age Range City Salary Range
30–39 - 90,000–99,999
30–39 - 90,000–99,999
30–39 - 80,000–89,999
40–49 Calgary 110,000–119,999
30–39 - 110,000–119,999
30–39 - 90,000–99,999
30–39 - 90,000–99,999
30–39 - 80,000–89,999
30–39 - 100,000–109,999
40–49 Calgary 90,000–99,999

Consider this 3-anonymous dataset. Is it also 2-diverse with respect to “Condition”?

Age Range ZIP Prefix Condition
20–29 130** Flu
20–29 130** Flu
20–29 130** Flu
30–39 148** Cold
30–39 148** Cold
30–39 148** Cancer
  • A. Yes, both groups have 2 or more different values
  • B. No, one group violates l-diversity
  • C. Yes, because the dataset is already k-anonymous
  • D. No, both groups have only one distinct value

There are still issues…

  • Even though the data is de-identified, some sensitive patterns can still leak through.

  • In the example we discussed, both individuals are grouped into the same age range and city.

  • While they are in different salary ranges and exact values are hidden, the range is still quite narrow.

  • Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.

Age Range City Salary Range
40–49 Calgary 110,000–119,999
40–49 Calgary 90,000–99,999

Differential privacy

  • So, we may need more sophisticated tools to privatize our data…
  • Differential privacy is a mathematical approach to protecting privacy
  • It ensures algorithm results are nearly the same whether one person’s data is included or not
  • Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)

Differential Privacy Example


Source: https://medium.com/data-science/a-differential-privacy-example-for-beginners-ef3c23f69401

Open Science



  • Open science is about making scientific research, data, and dissemination accessible to all.
  • It promotes transparency, collaboration, and innovation in research.
  • Includes open access publications, open data and open tools.
  • Supported by initiatives like FOSTER (Facilitate Open Science Training for European Research).

What are Open Data?



  • Open data refers to freely accessible, online data that can be used, reused, and shared with proper attribution given to the original source (FOSTER).
  • Sharing and reusing open data helps make research more transparent and reproducible.
  • Ethical considerations mean that not all data can (or should) be fully open (e.g., personal or sensitive data).

Why Open Data Matters



  • Reproducibility: Enables verification and replication of research.
  • Efficiency: Saves time and resources by reducing redundant data collection.
  • Collaboration: Allows researchers to combine datasets for new insights.
  • Innovation: Drives new discoveries and applications across disciplines.

Data Ownership & Licensing



  • Understanding data ownership is essential before sharing or licensing data.
  • Ownership depends on factors like:
    • Who collected or created the data.
    • Institutional policies and employment contracts.
    • Funding agency requirements and agreements.
    • The nature of the data - personal data may be subject to privacy laws (refer back to the Amazon example).

Balancing Openness & Ethics



  • Open science supports open data whenever ethically appropriate.
  • Some data must remain restricted due to privacy, security, or legal constraints.
  • Best practices help balance openness with responsibility.
  • Open data and open science movements often overlook marginalized individual’s rights and interests (e.g., Indigenous data).
  • The goal: “As open as possible, as closed as necessary.”

Key Takeaways

  • Data exists on a spectrum of identifiability
  • Even seemingly anonymous data can often be re-identified (e.g., mosaic effect)
  • Quasi-identifiers can lead to re-identification if not protected
  • Choosing privacy parameters involves balancing risk and data utility
  • Responsible data handling requires both technical skill and ethical awareness

Questions?